DOCUMENT RESUME 



ED 28i 859 TH 870 238 



ftUTHOR Linacre, John M. ? Wright , Benjamin D. 

^ITtE Item Bias: Hantel-Haenszel and the Rasch Model. 

Memorandi^ No. 39. 
INSTITUTION Finnish Association of Mathematics and Science 

Education Research. 
PUB DATE Feb 87 

N(3TE i7p. 

PUB TYPE Reports - Research/Technical (143) 



EDRS PRICE MFdi/PCdl Plus Postage. 

DESCRIPTORS ftigqrithms; Estimation (Mathematics) ; *Item Analysis 

Measurement techniques; ^Measures ( Individuals) ; 
Predictive Measurement ; *Test Bias? *Test Items; 
*Test theory 

IDENTIFIERS *Mantei Haenszel Procedure; *Rasch Model 



ABSTRACT 

The Mantei-Haenszei (>ni) procedure attempts to 
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paper stimmarizes the MH statistics, and identifies the parameters 
they estimate. Ah equivalent procedure based on the Rasch model is 
described. The theoretical properties of the two approaches are 
compared and shown to require the same assumptions, the MH procedure 
is shown to be statistically inferior to the Rasch procedure. 
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Ttte Mantel-Haenszel (HH) procedure attempts to identify and 
quantify differential item performance (item l>ias). Tliis paper 
summarizes tlie WH statistics, and identifies tne parameters they 
estimate. An equivalent procedure ]Dased on tlie Rascn model is 
described. Tiie tiieoretical properties of tlie two apprbaclies 
are compared and sliown to require tlie same assumptions. Tlie 
MH procedure is sliown to be statistically inferior to tlie Rascn 
procedure. 
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Introduction 



Tlxe identification and quantification of differences in 
item performance for contrasting groups of examinees is important 
if differences :between groups are to l>e understood^ and tests 
equivalent for groups are to 3t>e designed and maintained. 

Description of tixe Mantel-4I aen3ze 3. procedure 

Mantel & Haenszel (1959) discuss several statistical tecliniques 
for determining relative risK of disease occuring in individuals 
witS regard to tlse presence or a]:>sence of btlier factors. One 
approach is to divide tlxe samples under investigation into 
diseased and disease-f re^ groups, and tnen matcn sul>-categories 
of tliese groups according to the presence or a]:>sence of factors. 

In their discussion of wlxat is now referred to as tlie 
Hantei-Haenszel (HH) procedure, tney explain tliat their 
intention is to address the prohlem of determining overall 
relative risk of disease as a weignted average of tlie relative 
risks in the presence or absence of various factors, with the 
proviso that factors which affect tne risk in an extreme way are 
not encotihtered. 

The HH procedure ixa,i% since been proposed as an approacla to 
detecting differences in item performance hetween groups 
differing in some otixer way. 
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After test adralnistratlbn, tlae fli^st step of the fiH procedure 
Is to identify two examinee groups. These are the rffference 
group, R, (chosen to provide the standard performance on the 
Item of interest), and the focal group^ F, whose differential 
performance, if any, is to he detected and measured. (The 
formulation and terminology in this paper come from Holland & 
Thayer (1966)). These two groups correspond to the 
^^disease-free** and "disease Jl" groups. 

However, there are seldom any clear external categorizing 
factors, so implied levels of al>lllty are hypothesized as 
factors. The ahillty range of the groups is usually divitfed into 
three to five Intervals, and these intervals are used to match 
samples from each group. Hatching can he leased on whatever 
information is available, which usually includes examinees' 
scores oh the test of which tne item in question is a part. 

For each ahllity interval, of which there are now K, a 2x2 table 
is constructed from the responses by examinees in each sample in 
that interval to the target item. This table of responses made 
by the two sample groups in tne j'^h ability interval has the 
form shown in figure i. 



Sample droup 


Answer made: 
Riglit (IJ Wrong (0) 


Total in 
sample 


Reference Qroup (R ) 

J 

Focal Group (F ) 


A B 

C D 
J J 


N 

RJ j 

H_ 1 
FJ 


Coinbined Groups: 

i - - 1 


H. H 
iJ OJ 1 

1 


T 

6 



Figure 1. Data for the matched set of members of 

R» the reference ^Sroup, axxd F, the focus group. 
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Tlie HH procedure is based oh estimates of tne probability 6f a 
member of tlae reference group in interval J getting tlie item 
rigUt CPrj), or getting it wrong (Q^jh and similarly for a 
member of tlie focal group {Ppj) and (Opj). 

Two statistics are derived from these estimates: 

1. Ari estimate, oc, of tlie difference in performanc** between tlie 
two groups across all intervals. This is an estimate of the 
parameter, cc, wliich will satisfy 

» 

i^Rj/^RjJ = cc X (PFJ/Qpjj . JriiK (t) 

This a is that common odds-ratio of the two groups wfiich is 
shared by each of the K 2x2 tables. 

The HH equation for this performance difference estimator is 

a = S(AjDj/Tj)/S(BjCj/Tj) . j = (2) 

in which a has the range 6 to infinity with no differential 
performance (the null value) represented by 1. 

A transformation of this statistic is proposed by Holland and 
Thayer to create a synimetric scale witn null value of zero. This 
•♦deita scale** value is obtained by 



A = -<4/l. 7) X in(a) 



(3) 



- * of 16 - 



5 



According to Hoiiahd and TSayer, a proper standard error for tills 
estimate is hot yet determined tiiotxgix mucfi work lias Seen done m 
tills area. Tliey do include ah approximatioh wlilcli is dependent 
bri tlie humber and nature of responses in eacli ability level and 
tlie size of tlie a estimate. 

H. Ah estimate of the .'Statistical significance of tlie difference 
between the performance levels of reference and sample groups. 
This is a chi-s<iuare statistic with 1 degree of freedom, which, 
omitting correction for continuity, is 



CHISQ = (S(Aj) - S(E(Aj)))2 / S(Vai^(Aj)) , J = i,k 



where 



(5) 



and 



Var(Aj) = HRjNFjMijHoj/(TjTj(Tj-l)) . 



(6) 



The MH procedure and its application to problems in the 



medical sphere is further discussed in Fleiss (1973) 



pp. 117-116 and Bishop^ Fienberg, Holland (1975) pp.i46-l49. 
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Wnat does tiie HH difference statJUg44c-e3tj.inate? 



The practical application of tlie HH procedure requires an 



understanding of wliat tnese stiatistics estimate. 



Consider tlie a estimated Ijy 



(7) 



Tliis estimates tHe parameter a wlaicli fulfills 



C6) 



wliere eacli J corresponds to ah ability level 

As tne number, of ability levels is arbitrary, if this cc is to 
Have meaning beyond tlie particular matching sclaeme used, it must 
be independent of tlie number of levels chosen, it must also be 
independent of tlie number of pairs of examinees in eacn interval. 
In particular, it must Satisfy the equation when the number of 
intervals is constructed to be the same as the num>Der of pairs of 
examinees, with one pair of examinees in each interval. 

Consequently, reformulating and taking logarithms, cc must satisfy 



lh(a) = lh(PiR/Qj^) ln(Pp/Op> 



(9} 



for any and all pairs of examinees matched by ability 
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Differentia i j-tea 

Tie Rascn moael liypbttiesizes tSat eacU examinee Has an ability* 
and eacU Item lias a difficulty, D. if there is differential Item 
performance, tlie difficulty for tHe reference group Dp, will be 
different from tlie difficulty for tlie focus group bp. 

TUB Items on tlie test btlier tlian tJie suspect items can be uSed to 
determine ability estimates for tlie sample members of botn 
groups, (and item difficulties for all non-suspect items, if 
desired), according to RaScti model specifications. Procedures 
for performing tnis analysis are described In Wrlgut s Stone 
(1979). 
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Tilts analysis yields an ability estimate £> of tlie ability 
parameter B ior eacli examinee in eacli group 6h a common interval 
scale. Tlsen, by examining perfbrmahce on tSe suspect item, 
Rascii estimates are obtained for parameters wliicS satisfy, 
for eacix member of tlie reference sample group, 

B - = ln(PR/Qj^) , 

and, for eacn member of tlie focus sample groups 



B - Dp = ln(Pp/Qp) . ^^^j 
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THUS, for eaeli pair of examinees wiio are matched on ability^ 



Dp - = IiKPr/Or) - ln(Pp/Gfp) = in(oc) , (12) 

wiilcli is tlie formulation derived above for tiie In(oc) HH 
parameter. Siiice tlae B parameters cahcei, tiiis Rascli evaluation 
is independent of tlae distribution of a]oiiities. 



>Kimatioh to tlie item toias for eacli interval 

Tne item difficulty of the suspect item for tiie matched reference 
and focus groups in each interval may Be estimated toy tne normal 
approximation algorithm (PROX) (Wright & Stone, 1979, Chapter 2). 

For tlie reference group, the algorithm is 

^RJ = HrJ * ^Rj X ln(Bj/Aj) , (13) 
witli error* variance 

where 

Hrj is the mean ability of the reference gr oup in 
interval ^, 

Xj^j is a correction factor for the distribution of 
abilities xn interval d, 

and where 

with s^g^j as the ability variance of the 

reference group in interval j. 
- e of 16 - 



For tlie ^bciis groups the aigoritltm is similarly 

Willi error variance 

s^pj = x2pj(Dj+ej)/Djej (17) 

wliere 

Hpj is tlae mean ability of tlie reference group in 
interval j, 

Xpj is a correction factor for tiie distrllsutlon of 
abilities m interval J, 

and VfHere 
witli 

^^BFJ ability variance of the 

focus group in interval J. 

So tne item bias iri interval j can be estimated from 

— = ^'^(AjDj/BjCj) , (19) 

wnich is In(aj) for tlie J^^ ihter^val. 

THUS, if Hpj equals Mpj (tlie "matched" groups have equal means) 
and Xj^j equals Xpj (tlie "matched" groups have equai variances), 
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tlxen, wnen matcliing tlie J"^^ interval, 

liitaj) z (Klpj - dRj)\XRj , (£b) 

l.e. tne HH 2:>las stati^l.ic is equivalent to tlie difference 
betfreexi -tne Rascli difficulty "PROX'* estimates of tSe suspect item 
for reference and focus sroupsp adjusted by- tSe scale coefficient 
^Rj* 

Tlie standard error of tliis formulation of Ih(aj) is 

wliicli l:>ecoines, after expansion^ 

SpRJ = XrJ >r( (Aj+Bj)/AjBj + (I>j+Cj)/DjC:j) , (22) 

Tlie scale coefficeht Xrj cancels when tlie test statistic 
for tlie presence of bias in interval J is formed £y 

3j = ln(AjDj/EjCj)/>r( (Aj+Bj)/AjBj + <^J +<^J ) /I>aCj ) . (23) 

Tlius^ if the ability distributions in the j"^^ interval of tne 
reference and focus groups are -approximately normal and ••iiiatclied"j 
to the extent that they have equal means and variances, tlieii 
the Rasch hormai approximation algorithm (PROX) can l^e used for 
estimating and testing for item bias in each interval. 
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Generalizing itein bias ^ across all intervals with the Rascii apprback 

When data fit a Rasch model, Ox^, estimates, dp and d^, Secome 
independent of the ability cbmpbsitibri bf the reference group and 
focus group examinees. 

Cbhsequehtlir the cbmparisbn bf Dp and D^ does not require 
any matching of ability levels and cbnsequently their estimates 
dp and dj^ can he calculated frbm all, br any convenient 
subset, bf the reference and fbcus grbups without intervals or 
ihatchihg. 

The standard errors for dp and dp are well-defined, 
and calculated during the estimation procedure as sp and S|^. 
The standard error of the difference between the difficulty 
estimates, which measures the item bias, is 

S. E. (Intoc)) = S, E. (dp - dp) = i^ts^p + s^r) . (2*) 

These standard errors depend on the numbers of examinees and 
their ability distributibns, hut are independent of the size of 
the difference between dp and dp, which determines the a 
estimate. 
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TSe fundamental requirement of tlie HH procedure is that 
tSe pro£ato±i±t±es of success for tlie reference and focus 
groups £ear the same relat±bnsli±p across all intervals. 
tixls uniformity of relationsliip is required to calculate the a 
estimate. But this calculation requires the imposition of an 
arbitrary segmentation and matcliing scheme oh the two groups to 
he compared. Consequently, the distribution of abilities, 
selection of interval boundaries arid the absolute sizes of 
reference and focus groups must affect the magnitude of the a 
How then can this procedure estimate a parameter intended to be 
independent of the ability range and sample size? 

The Rasch analysis builds on tlie same assumptions that the HH 
procedure implies and requires, but, by utilizing all the 
relevant information available from every response by the 
reference and focus groups, a Hascli analysis is able to provide 
a ln(a) estimate of smaller, and better estimable, standard 
error, that is independent of both ability distributions. 

The Rasch in(a) estimate is in "logistic unxts"* logits^ and 
the "delta scale", is a proportional adjustment of this Ibgit 
scale. 
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Wiiat does the MH significance statistic estimate? 

It is hot enough to calculate an a estimate for tiie performance 
difference a hetweeh any two groups. We must also evaluate 
the statistical significance of tnis difference. THe MH statistic 
for tills is 



CHISQ = - SCE(Aj)))^ / S(Var(Aj)) , d=i,i: (25) 

wiiere 
and 



Var(Aj) = 2lRjHpjHijMoj/(TjTj(Tj"l)) . (27) 



After algebraic manipulation, tliis ]Q»ecomes 



CHiSQ = (S((AjDj - BjCj)/Tj))^ / S(Var(Aj)) , (28) 



and f urtlier l>ecomes 



CHISQ = (S((Aj/NRj)-(Cj/NFj))NRjNpj/Tj)2/2(Var(Aj)) , (29) 



tout Aj/Mpj is an estimate (Ppj) of tlie probability of an 
examinee in the reference group getting tlie item rights similarly 
Cj/Npj is an estimate of Ppj, tlie probability of an examinee 
in the focus group getting the question riglat. 

So, tne Significance statistic is estimating 

CHISQ = (S(PRj - F1?j)NRjHpj/Tj)2/2(Var(A^) ) . (3b) 
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THUS tiie HH i±gn±f±6ance statistic is obtaiSid toy aviriging over 
different aSiilty levels tlie difference betwien tHe groups of tne 
protoabiiity oi obtaining a correct responsi to ttoe item; 
Figure 2 sSows now, for any item wliicS iias amy power to 
differentiate between liign and low ability examinees at all, and 
for wbicli tSe difference parameter a is not null, - Pp must 
appear for different ability levels. Obviously no empirical mean 
value can represent tbis difference uniquely, its size depends on 
tlie number, widtb and probability level of tSe intervals cbosen. 
consequently tbe HH CHISQ is not a stable statistic. If examinees 
are grouped by raw score on tbe test of wbicS ttoe item in question 
was a part, tbe arbitrary nature Of tbe intervals may be removed, 
but tbe non-linear difference sbown in figure 2 remains. 




low ability 



bigb ability- 



Figure 2. Difference in tlie absolute value of probable response 

to a particular item between reference and focus groups 
P'lotted against examinee ability. 
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T3ie Rascli ^S4ga4^44ga nce St atistic 

Tlie Rascll determination of in(a)t via wltii its standard 
error Sp and dp witli its standard error sp is 
independent of analyst wnim. 

Tlie statistics necessary to determine iixe statistical significance 
of any difference between reference and foctis groups are ' 
routinely provided by Rascli analysis. Tiie significance of an 
item bias can be determined by calculating tSe difference between 
tbe Rascn item difficulty estimates for tlae two groups* scaled 
for tbeir standard error; 
Tlie test statistic is 

z = (dR - dp) / rts^R s2p) , 
or, in HH terminology, 

z = ln(a) / S. E. Inla.) . 

Note tbat tbe Rascli transformations liave removed tlie effect of tlie 
non-linearity sbown in figure 2, 

Goneliision 

Tl:e Hantel-Haenszel procedure ±s an attempt to determine 
indirectly wbat Rascb analysis provides directly. Tlse HH 
procedure involves tbeoretical uncertainties and depends on 
arbitrary decisions by tbe analyst wiio uses it. 
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If brie is riot prepared to accept the validity of the 
RascH model for the item under examination, the implicit 
assumptions of the HH procedure will not he satisfied eitlaer. 
If brie is prepared to accept tlie Rasch assumptions, liowever, 
tlie Rasch model yields simpler and better statistics. 
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